AITopics | small learning rate

incorporate feedback into our final revision. 4 [R1]: " I don't exactly see if small batch vs large batch captures this phenomenon; if yes should say explicitly. "

Neural Information Processing SystemsFeb-13-2026, 21:02:44 GMT

We thank the reviewers for the detailed and insightful reviews. As the reviews noted, our work 1) introduces "novel Smith et al. [2017] make an explicit connection between small vs. large batch "A small discussion on if the phenomenon has been observed for different datasets/tasks with different optimizers" The phenomenon may not be true for other optimizers such as Adam, though. "concept of "memorizable and generalizable", though intuitive, is sketchy and not formally explained ... authors We acknowledge that the terms "memorizable" and "generalizable" are potentially confusing. We will revise our terminology to clarify this distinction. By "inherently noisy", we refer to the fact that high noise in the datapoints will necessitate larger sample complexity.

artificial intelligence, machine learning, noise, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

AnalyzingLotteryTicketHypothesisfrom PAC-BayesianTheoryPerspective

Neural Information Processing SystemsFeb-11-2026, 21:18:21 GMT

However,sincetheinitial large learning rate generally helps the optimizer to converge to flatter minima, we hypothesize that the winning tickets have relatively sharp minima, which is considered a disadvantage in terms of generalization ability.

artificial intelligence, machine learning, ticket, (18 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
Europe > France (0.04)

Genre: Contests & Prizes (0.40)

Industry: Leisure & Entertainment (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)

Add feedback

Preconditioning Matters: Fast Global Convergence of Non-convex Matrix Factorization via Scaled Gradient Descent

Neural Information Processing SystemsDec-27-2025, 04:37:32 GMT

Low-rank matrix factorization (LRMF) is a canonical problem in non-convex optimization, the objective function to be minimized is non-convex and even non-smooth, which makes the global convergence guarantee of gradient-based algorithm quite challenging.

fast global convergence, non-convex matrix factorization, varepsilon, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.95)

Add feedback

Topological Invariance and Breakdown in Learning

Yang, Yongyi, Poggio, Tomaso, Chuang, Isaac, Ziyin, Liu

arXiv.org Artificial IntelligenceOct-6-2025

We prove that for a broad class of permutation-equivariant learning rules (including SGD, Adam, and others), the training process induces a bi-Lipschitz mapping between neurons and strongly constrains the topology of the neuron distribution during training. This result reveals a qualitative difference between small and large learning rates $η$. With a learning rate below a topological critical point $η^*$, the training is constrained to preserve all topological structure of the neurons. In contrast, above $η^*$, the learning process allows for topological simplification, making the neuron manifold progressively coarser and thereby reducing the model's expressivity. Viewed in combination with the recent discovery of the edge of stability phenomenon, the learning dynamics of neuron networks under gradient descent can be divided into two phases: first they undergo smooth optimization under topological constraints, and then enter a second phase where they learn through drastic topological simplifications. A key feature of our theory is that it is independent of specific architectures or loss functions, enabling the universal application of topological methods to the study of deep learning.

artificial intelligence, machine learning, neural network, (15 more...)

arXiv.org Artificial Intelligence

2510.0267

Country: North America > United States (0.67)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Add feedback

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Yuanzhi Li, Colin Wei, Tengyu Ma

Neural Information Processing SystemsAug-20-2025, 00:46:33 GMT

Neural Information Processing Systems http://nips.cc/

arxiv preprint arxiv, initial learning rate, learning rate, (14 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.31)

Add feedback

incorporate feedback into our final revision. 4 [R1]: " I don't exactly see if small batch vs large batch captures this phenomenon; if yes should say explicitly. "

Neural Information Processing SystemsAug-20-2025, 00:46:19 GMT

We thank the reviewers for the detailed and insightful reviews. As the reviews noted, our work 1) introduces "novel Smith et al. [2017] make an explicit connection between small vs. large batch "A small discussion on if the phenomenon has been observed for different datasets/tasks with different optimizers" The phenomenon may not be true for other optimizers such as Adam, though. "concept of "memorizable and generalizable", though intuitive, is sketchy and not formally explained ... authors We acknowledge that the terms "memorizable" and "generalizable" are potentially confusing. We will revise our terminology to clarify this distinction. By "inherently noisy", we refer to the fact that high noise in the datapoints will necessitate larger sample complexity.

incorporate feedback, noise, small learning rate, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback